133 research outputs found
Photo-realistic face synthesis and reenactment with deep generative models
The advent of Deep Learning has led to numerous breakthroughs in the field of Computer Vision. Over the last decade, a significant amount of research has been undertaken towards designing neural networks for visual data analysis. At the same time, rapid advancements have been made towards the direction of deep generative modeling, especially after the introduction of Generative Adversarial Networks (GANs), which have shown particularly promising results when it comes to synthesising visual data. Since then, considerable attention has been devoted to the problem of photo-realistic human face animation due to its wide range of applications, including image and video editing, virtual assistance, social media, teleconferencing, and augmented reality. The objective of this thesis is to make progress towards generating photo-realistic videos of human faces. To that end, we propose novel generative algorithms that provide explicit control over the facial expression and head pose of synthesised subjects. Despite the major advances in face reenactment and motion transfer, current methods struggle to generate video portraits that are indistinguishable from real data. In this work, we aim to overcome the limitations of existing approaches, by combining concepts from deep generative networks and video-to-video translation with 3D face modelling, and more specifically by capitalising on prior knowledge of faces that is enclosed within statistical models such as 3D Morphable Models (3DMMs). In the first part of this thesis, we introduce a person-specific system that performs full head reenactment using ideas from video-to-video translation. Subsequently, we propose a novel approach to controllable video portrait synthesis, inspired from Implicit Neural Representations (INR). In the second part of the thesis, we focus on person-agnostic methods and present a GAN-based framework that performs video portrait reconstruction, full head reenactment, expression editing, novel pose synthesis and face frontalisation.Open Acces
Dynamic Neural Portraits
We present Dynamic Neural Portraits, a novel approach to the problem of
full-head reenactment. Our method generates photo-realistic video portraits by
explicitly controlling head pose, facial expressions and eye gaze. Our proposed
architecture is different from existing methods that rely on GAN-based
image-to-image translation networks for transforming renderings of 3D faces
into photo-realistic images. Instead, we build our system upon a 2D
coordinate-based MLP with controllable dynamics. Our intuition to adopt a
2D-based representation, as opposed to recent 3D NeRF-like systems, stems from
the fact that video portraits are captured by monocular stationary cameras,
therefore, only a single viewpoint of the scene is available. Primarily, we
condition our generative model on expression blendshapes, nonetheless, we show
that our system can be successfully driven by audio features as well. Our
experiments demonstrate that the proposed method is 270 times faster than
recent NeRF-based reenactment methods, with our networks achieving speeds of 24
fps for resolutions up to 1024 x 1024, while outperforming prior works in terms
of visual quality.Comment: In IEEE/CVF Winter Conference on Applications of Computer Vision
(WACV) 202
Free-HeadGAN: Neural Talking Head Synthesis with Explicit Gaze Control
We present Free-HeadGAN, a person-generic neural talking head synthesis
system. We show that modeling faces with sparse 3D facial landmarks are
sufficient for achieving state-of-the-art generative performance, without
relying on strong statistical priors of the face, such as 3D Morphable Models.
Apart from 3D pose and facial expressions, our method is capable of fully
transferring the eye gaze, from a driving actor to a source identity. Our
complete pipeline consists of three components: a canonical 3D key-point
estimator that regresses 3D pose and expression-related deformations, a gaze
estimation network and a generator that is built upon the architecture of
HeadGAN. We further experiment with an extension of our generator to
accommodate few-shot learning using an attention mechanism, in case more than
one source images are available. Compared to the latest models for reenactment
and motion transfer, our system achieves higher photo-realism combined with
superior identity preservation, while offering explicit gaze control
HeadGAN: one-shot neural head synthesis and editing
Recent attempts to solve the problem of head reenactment using a single reference image have shown promising results. However, most of them either perform poorly in terms of photo-realism, or fail to meet the identity preservation problem, or do not fully transfer the driving pose and expression. We propose HeadGAN, a novel system that conditions synthesis on 3D face representations, which can be extracted from any driving video and adapted to the facial geometry of any reference image, disentangling identity from expression. We further improve mouth movements, by utilising audio features as a complementary input. The 3D face representation enables HeadGAN to be further used as an efficient method for compression and reconstruction and a tool for expression and pose editing
Generalizing Gaze Estimation with Weak-Supervision from Synthetic Views
Developing gaze estimation models that generalize well to unseen domains and
in-the-wild conditions remains a challenge with no known best solution. This is
mostly due to the difficulty of acquiring ground truth data that cover the
distribution of possible faces, head poses and environmental conditions that
exist in the real world. In this work, we propose to train general gaze
estimation models based on 3D geometry-aware gaze pseudo-annotations which we
extract from arbitrary unlabelled face images, which are abundantly available
in the internet. Additionally, we leverage the observation that head, body and
hand pose estimation benefit from revising them as dense 3D coordinate
prediction, and similarly express gaze estimation as regression of dense 3D eye
meshes. We overcome the absence of compatible ground truth by fitting rigid 3D
eyeballs on existing gaze datasets and design a multi-view supervision
framework to balance the effect of pseudo-labels during training. We test our
method in the task of gaze generalization, in which we demonstrate improvement
of up to compared to state-of-the-art when no ground truth data are
available, and up to when they are. The project material will become
available for research purposes.Comment: 13 pages, 12 figure
- …